Linguistic correlates of style: authorship classification with deep linguistic analysis features

نویسنده

  • Michael Gamon
چکیده

The identification of authorship falls into the category of style classification, an interesting sub-field of text categorization that deals with properties of the form of linguistic expression as opposed to the content of a text. Various feature sets and classification methods have been proposed in the literature, geared towards abstracting away from the content of a text, and focusing on its stylistic properties. We demonstrate that in a realistically difficult authorship attribution scenario, deep linguistic analysis features such as context free production frequencies and semantic relationship frequencies achieve significant error reduction over more commonly used “shallow” features such as function word frequencies and part of speech trigrams. Modern machine learning techniques like support vector machines allow us to explore large feature vectors, combining these different feature sets to achieve high classification accuracy in style-based tasks.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Authorship Verification via k-Nearest Neighbor Estimation Notebook for PAN at CLEF 2013

In this paper we describe our k-Nearest Neighbor (k-NN) based Authorship Verification method for the Author Identification (AI) task of the PAN 2013 challenge. The method follows an ensemble classification technique based on the combination of suitable feature categories. For each chosen feature category we apply a k-NN classifier to calculate a style deviation score between the training docume...

متن کامل

Linguistic Analysis of the Main Traits of Stream of Consciousness in the Persian Translations of Virginia Woolf's Mrs. Dalloway and James Joyce's A Portrait of the Artist as a Young Man

This study investigated how the main linguistic traits of stream of consciousness novels are realized in Persian translations and also the frequency of translation strategies used by translators. Accordingly, a restricted set of linguistic parameters which Totò (2014) asserts can show the stream of thought of character(s), is chosen including punctuation, exclamatory utterances, interjections, ...

متن کامل

A Study of the Relationship between Acoustic Features of “bæle” and the Paralinguistic Information

Language users benefit from special phonetic tools in order to communicate linguistic information as well as different emotional aspects and paralinguistic information through daily conversation. Having functions in conveying semantic information to listeners, prosodic features form the essential part of linguistic behavour, manipulating  them potentially can play an important role in transmitt...

متن کامل

Authorship Identification Using a Reduced Set of Linguistic Features

The proposed solution for authorship attribution combines a couple of the most important features identified in previous research in this domain with classification algorithms in order to detect the correct author. We consider that the most relevant aspect of our work is the small number of linguistic features and the use of the same framework to solve both the open and the closed class authors...

متن کامل

More Blogging Features for Author Identification

In this paper we present a novel improvement in the field of authorship identification in personal blogs. The improvement in authorship identification, in our work, is by utilizing a hybrid collection of linguistic features that best capture the style of users in diaries blogs. The features sets contain LIWC with its psychology background, a collection of syntactic features & part-of-speech (PO...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004